We introduce a stochastic graph-based method for computing relativeimportance of textual units for Natural Language Processing. We test thetechnique on the problem of Text Summarization (TS). Extractive TS relies onthe concept of sentence salience to identify the most important sentences in adocument or set of documents. Salience is typically defined in terms of thepresence of particular important words or in terms of similarity to a centroidpseudo-sentence. We consider a new approach, LexRank, for computing sentenceimportance based on the concept of eigenvector centrality in a graphrepresentation of sentences. In this model, a connectivity matrix based onintra-sentence cosine similarity is used as the adjacency matrix of the graphrepresentation of sentences. Our system, based on LexRank ranked in first placein more than one task in the recent DUC 2004 evaluation. In this paper wepresent a detailed analysis of our approach and apply it to a larger data setincluding data from earlier DUC evaluations. We discuss several methods tocompute centrality using the similarity graph. The results show thatdegree-based methods (including LexRank) outperform both centroid-based methodsand other systems participating in DUC in most of the cases. Furthermore, theLexRank with threshold method outperforms the other degree-based techniquesincluding continuous LexRank. We also show that our approach is quiteinsensitive to the noise in the data that may result from an imperfect topicalclustering of documents.
展开▼